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Abstract 

Human  motion  tracking  is  an  important  problem  in  com¬ 
puter  vision.  Most  prior  approaches  have  concentrated 
on  efficient  inference  algorithms  and  prior  motion  models; 
however,  few  can  explicitly  account  for  physical  plausibility 
of  recovered  motion.  The  primary  purpose  of  this  work  is  to 
enforce  physical  plausibility  in  the  tracking  of  a  single  artic¬ 
ulated  human  subject.  Towards  this  end,  we  propose  a  full- 
body  3D  physical  simulation-based  prior  that  explicitly  in¬ 
corporates  motion  control  and  dynamics  into  the  Bayesian 
filtering  framework.  We  consider  the  human's  motion  to  be 
generated  by  a  “control  loop  ”.  In  this  control  loop,  Newto¬ 
nian  physics  approximates  the  rigid-body  motion  dynamics 
of  the  human  and  the  environment  through  the  application 
and  integration  of  forces.  Collisions  generate  interaction 
forces  to  prevent  physically  impossible  hypotheses.  This  al¬ 
lows  us  to  properly  model  human  motion  dynamics,  ground 
contact  and  environment  interactions.  For  efficient  infer¬ 
ence  in  the  resulting  high- dimensional  state  space,  we  in¬ 
troduce  exemplar-based  control  strategy  to  reduce  the  ef¬ 
fective  search  space.  As  a  result  we  are  able  to  recover 
the  physically -plausible  kinematic  and  dynamic  state  of  the 
body  from  monocular  and  multi-view  imagery.  We  show, 
both  quantitatively  and  qualitatively,  that  our  approach  per¬ 
forms  favorably  with  respect  to  standard  Bayesian  filtering 
methods. 


1.  Introduction 

Physics  plays  an  important  role  in  characterizing,  de¬ 
scribing  and  predicting  motion.  Dynamical  simulation  al¬ 
lows  one  to  computationally  account  for  various  physical 
factors,  e.g .,  a  person’s  mass,  interaction  with  the  ground 
plane,  friction,  self-collisions  or  physical  disturbances.  A 
tracking  system  can  take  advantage  of  physical  prediction  to 
cope  with  incomplete  information  and  reduce  uncertainty. 
For  example,  ambiguities  due  to  self-occlusions  in  monoc¬ 
ular  sequences  could  potentially  be  resolved  by  incorporat¬ 
ing  a  passive  dynamics-based  (rag-doll)  prediction.  Pose 
changes  that  are  unlikely  or  which  violate  physical  con- 


Figure  1.  Incorporating  physics-based  dynamic  simulation 
with  joint  actuation  and  dynamic  interaction  into  Bayesian  fil¬ 
tering.  Illustration  of  the  figure  model,  on  the  left,  shows  collision 
geometries  of  the  figure  segments  (top-left),  the  joints  and  skeletal 
structure  (middle-left),  and  the  visual  representation  correspond¬ 
ing  to  an  image  projection  (bottom-left).  Most  joints  have  3  an¬ 
gular  degrees  of  freedom  (DOFs),  except  for  the  knee  and  elbow 
joints  (1  angular  DOF),  spine  joint  and  the  clavicle  joints  (2  angu¬ 
lar  DOFs)  and  the  root  joint  (3  linear  and  3  angular  DOFs).  The 
figure’s  motion  is  determined  by  its  dynamics,  actuation  forces  at 
joints  (right- top)  and  surface  interaction  at  contacts  (right-bottom). 

straints  can  be  given  lower  weights,  constraining  the  space 
of  poses  to  search  over  and  boosting  performance.  We 
claim  that  proper  utilization  of  dynamics-based  prediction 
will  significantly  improve  the  quality  of  motion  tracking. 

We  propose  a  means  for  incorporating  full  body  physical 
simulation  with  probabilistic  tracking.  The  tracked  individ¬ 
ual  is  modeled  as  an  actuated  articulated  structure  (“figure”) 
composed  of  three-dimensional  rigid  body  segments  con¬ 
nected  by  joints.  Segments  correspond  to  parts  of  the  figure 
body,  like  the  torso,  head  and  limbs.  The  inference  process 
uses  Bayesian  filtering  to  estimate  the  posterior  probability 
distribution  over  figure  states,  consisting  of  recursive  pa- 
rameterizations  of  figure  poses  (relative  joint  DOF  values 
and  velocities)  and  associated  information.  Posterior  distri¬ 
bution  is  represented  by  samples  corresponding  to  individ¬ 
ual  state  hypotheses.  New  state  hypotheses  are  generated 
from  past  hypotheses  by  running  motion  predictors  based 
on  physical  simulation  and  interpolation  of  training  joint 
DOF  data  that  define  a  prior  over  valid  kinematic  poses. 
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Prediction  algorithms  can  exploit  knowledge  about  the  envi¬ 
ronment  and  incorporate  the  intentions  (policy,  goal)  of  the 
tracked  individual  into  the  prediction  process.  We  assume 
that  the  segment  shapes,  mass  properties,  collision  geome¬ 
tries  and  other  associated  parameters  are  known  and  remain 
constant  throughout  the  sequence. 

We  present  results  that  demonstrate  the  utility  of  using  a 
physics-based  prior  for  tracking,  compare  the  method  per¬ 
formance  against  other  commonly  used  methods  and  show 
favorable  performance  under  the  effects  of  dynamic  inter¬ 
action  exhibited  in  monocular  and  multi- view  video. 

2.  Related  Work 

There  has  been  a  vast  amount  of  work  in  computer  vi¬ 
sion  in  the  past  10-15  years  on  articulated  human  motion 
tracking  (we  refer  reader  to  [5]  for  more  detailed  review 
of  the  literature).  Most  approaches  [1,  3,  13]  have  concen¬ 
trated  on  development  of  efficient  inference  methods  that 
are  able  to  handle  the  high-dimensionality  of  a  human  pose. 
Generative  methods  typically  propose  to  either  learn  a  low¬ 
dimensional  embedding  of  the  high-dimensional  kinematic 
data  and  then  attempt  to  solve  the  problem  in  this  more  man¬ 
ageable  low-dimensional  space  [15],  or  alternatively  advo¬ 
cate  the  use  of  prior  models  to  reduce  effective  search  space 
in  the  original  high-dimensional  space  [3].  More  recent  dis¬ 
criminative  methods  have  attempted  to  go  directly  from  im¬ 
age  features  to  the  3D  articulated  pose  from  either  monocu¬ 
lar  imagery  [11,  14]  or  multiple  views. 

Producing  smooth  and  accurate  tracking  remains  a  chal¬ 
lenging  problem,  especially  for  monocular  imagery.  In  par¬ 
ticular,  many  of  the  produced  results  lack  plausible  physi¬ 
cal  realism  and  often  violate  the  constraints  imposed  on  the 
body  by  the  world  (resulting  in  out-of-plane  rotations  and 
foot  skate).  Such  artifacts  can  be  attributed  to  the  general 
lack  of  physically  plausible  priors  [2]  (that  can  account  for 
static  and/or  dynamic  balance  and  ground-person-object  in¬ 
teractions)  which  provide  an  untapped  and  very  rich  source 
of  information. 

The  computer  graphics  and  robotics  community,  on  the 
other  hand,  has  been  very  successful  in  developing  realis¬ 
tic  physical  models  of  human  motion.  These  models  for  the 
most  part  have  only  been  developed  and  tested  in  the  context 
of  synthesis  (i.e.  animation  [6,  10,  19,  17])  and  humanoid 
robotics  [18].  Here,  we  introduce  a  method  that  uses  a  full 
body  physics-based  dynamical  model  as  a  prior  for  articu¬ 
lated  human  motion  tracking.  This  prior  accounts  for  phys¬ 
ically  plausible  human  motion  dynamics  and  environmental 
interactions,  such  as  disallowing  foot-ground  penetration. 

Earliest  work  on  integrating  physical  models  with 
vision-based  tracking  can  be  attributed  to  influential  work 
by  Metaxas  at  el  [9]  and  Wren  at  el  [16].  In  both  [9]  and 
[16]  a  Lagrangian  formulation  of  the  dynamics  was  em¬ 
ployed,  within  the  context  of  a  Kalman  filter,  for  tracking 


of  simple  upper  body  motions  using  segmented  3D  marker 
[9]  or  stereo  [16]  observations.  In  contrast,  we  incorpo¬ 
rate  full  body  human  dynamical  simulation  into  a  Parti¬ 
cle  Filter,  suited  for  multi-modal  posteriors  that  commonly 
arise  from  ambiguities  in  monocular  imagery.  More  re¬ 
cently,  Brubaker  at  el  [2]  introduced  a  low-dimensional 
biomechanically-inspired  model  that  accounts  for  human 
lower-body  walking  dynamics.  The  low-dimensional  nature 
of  the  model  [2]  facilated  the  tractable  inference;  however, 
the  model,  while  powerful,  is  inherently  limited  to  walking 
motions  in  2D. 

In  this  work,  we  introduce  a  more  general  full-body 
model  that  can  potentially  model  a  large  variety  of  human 
motions.  However,  the  high-dimensionality  of  our  model 
makes  inference  using  standard  techniques  ( e.g .  particle 
filtering)  challenging.  To  this  end,  we  also  introduce  an 
exemplar-based  prior  for  the  dynamics  to  limit  the  effec¬ 
tive  search  space  and  allow  tractable  inference  in  this  high¬ 
dimensional  space.  Exemplar  based  methods  similar  to  ours 
have  been  successfully  used  for  articulated  pose  estimation 
in  [11,  15],  dynamically  adaptive  animation  [20],  and  hu¬ 
manoid  robot  imitation  [7].  Here,  we  extend  the  prior  ex¬ 
emplar  methods  [1 1]  to  deal  with  exemplars  that  account  for 
single-frame  kinematics  and  dynamics  of  human  motion. 

3.  Tracking  with  Dynamical  Simulation 

Tracking,  including  human  motion  tracking,  is  most  of¬ 
ten  formulated  as  Bayesian  filtering  [4],  which  in  com¬ 
puter  vision  literature  is  often  implemented  in  the  form 
of  a  Particle  Filter  (PF).  In  PF  the  posterior ,  p(x/|yi:/), 
where  x/  is  the  state  of  the  body  at  time  instant  /  and 
yi :f  is  the  set  of  observations  up  to  the  time  instant  /, 
is  approximated  using  a  set  of  (typically)  weighted  sam¬ 
ples/particles  and  is  computed  recursively,  p(x /+i  |yi:/)  oc 

p(y/+ilx/+i)/p(x/+ilx/)p(x/|yi:/)dx/-  In  this  for- 

mulation,  p(x/|yi:/)  is  the  posterior  from  the  previous 
frame  and  p(y/+i|x/+i)  is  the  likelihood  that  measures 
how  well  a  hypothesis  at  time  instant  /  +  1  explains  the 
observations;  the  p(x /+i  |x/)  is  often  referred  to  as  the  tem¬ 
poral  prior  and  is  the  main  focus  of  this  paper. 

The  temporal  prior  is  often  modeled  as  a  first  or  sec¬ 
ond  order  linear  dynamical  system  with  Gaussian  noise. 
For  example,  in  [1,  3]  the  non-informative  smooth  prior 
p(x/+i|x/)  =  A T(x/,F),  which  facilitates  continuity  in 
the  recovered  motions,  was  used;  alternatively,  constant  ve¬ 
locity  temporal  priors  of  the  form  p(x /+i  |x/)  =  A T(x/  + 
7/,E)  (where  7/  is  scaled  velocity  learned  or  inferred), 
have  also  been  proposed  [13]  and  shown  to  have  favorable 
properties  when  it  comes  to  monocular  imagery.  However, 
human  motion,  in  general,  is  non-linear  and  non- stationary. 

Physical  Newtonian  simulation  is  better  suited  as  the  ba¬ 
sis  for  a  temporal  prior  that  addresses  these  issues.  For 
simulation,  our  world  abstraction  consists  of  a  known  static 


environment  and  a  loop-free  articulated  structure  (“figure”) 
representing  the  individual  to  be  tracked.  We  assume  “phys¬ 
ical  properties”  (mass,  inertial  properties,  and  collision  ge¬ 
ometries)  are  known  for  each  rigid  body  segment.  Given 
these  properties  and  a  state  hypothesis  at  frame  /,  we  use 
constrained  dynamics  simulator  within  the  “control  loop” 
to  predict  the  state  at  the  next  frame.  Constraints  are  used 
to  model  various  physical  phenomena  like  interaction  with 
the  environment  and  to  control  the  figure  motion.  Motion 
planning  and  control  procedures  incorporate  training  mo¬ 
tion  capture  data  in  order  to  estimate  the  human’s  next  in¬ 
tended  pose  and  produce  corresponding  motion  constraints 
that  would  drive  the  figure  towards  its  intended  pose.  Sim¬ 
ilar  to  earlier  methods,  we  add  Gaussian  noise  (with  diago¬ 
nal  covariance)  to  the  dynamics  to  account  for  observation 
noise  and  minor  physical  disturbances. 

3.1.  Body  Model  and  State  Space 

Our  figure  (body)  consists  of  13  rigid  body  segments  and 
has  a  total  of  31  degrees  of  freedom  (DOFs),  as  illustrated  in 
Figure  1 .  Segments  are  linked  to  parent  segments  by  either 
1-DOF  (hinge),  2-DOF  (saddle)  or  3-DOF  (ball  and  socket) 
rotational  joints  to  ensure  that  only  relevant  rotations  about 
specific  joint  axes  are  possible.  The  root  segment  is  “con¬ 
nected”  to  the  world  space  origin  by  a  6-DOF  global  “joint” 
whose  DOF  values  define  the  global  figure  orientation  and 
position.  The  values  of  rotational  joint  DOFs  are  encoded 
using  Euler  angles.  Collision  geometries  attached  to  indi¬ 
vidual  segments  affect  physical  aspects  of  the  motion.  Seg¬ 
ment  shapes  define  visual  appearance  of  the  segments. 

Joint  DOF  values  concatenated  along  the  kinematic  tree 
define  the  kinematic  pose ,  q,  of  the  figure.  Joint  DOF  ve¬ 
locities,  q,  defined  as  the  time  derivatives,  together  with 
the  kinematic  pose  q  determine  the  figure’s  dynamic  pose 
[q,  q] .  The  pose  is  considered  invalid  if  it  causes  self¬ 
penetration  of  body  parts  and/or  penetration  with  the  en¬ 
vironment,  or  if  the  joint  DOF  values  are  out  of  the  valid 
ranges  that  are  learned  from  the  training  motion  capture 
data.  These  constraints  on  the  kinematic  pose  allow  us  to 
reject  invalid  samples  early  in  the  filtering  process. 

The  control  policy  information  comprises  of  the  iden¬ 
tifier  7 r  of  the  policy  type  and  the  frame  index  v  the  pol¬ 
icy  became  effective.  The  policy  type  can  either  be  active 
motion-capture-based  (tt  a)  or  passive  (7 rp).  When  the  pas¬ 
sive  policy  is  in  effect,  no  motion  control  takes  place.  The 
final  figure  state  x  is  defined  as  a  tuple  [q,  q,  7 r,  v\,  where 
q  G  M31,  q  G  M31,  7 r  G  {7G4,  7 rp},  v  G  N1. 

3.2.  Likelihood 

The  likelihood  function  measures  how  well  a  particular 
hypothesis  explains  image  observations  I  / .  We  employ  a 
relatively  generic  likelihood  model  that  accounts  for  silhou¬ 
ette  and  edge  information  in  images  [1].  We  combine  these 
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Figure  2.  Image  Likelihood.  The  coupled  observation  y/,  con¬ 
sisting  of  two  consecutive  frames  1/  (upper  left)  and  I/+i  (lower 
left),  matches  the  dynamic  pose  [q/,  q/]  well  if  features  (silhou¬ 
ette  and  edges)  at  frame  /  fit  the  kinematic  pose  q /  (red  pose)  and 
features  at  frame  /  +  1  fit  the  kinematic  pose  q/+i  (green  pose). 

two  different  feature  types  and  across  views  (for  multi¬ 
view  sequences)  using  independence  assumptions.  Result¬ 
ing  likelihood,  p(I/|q/),  of  the  kinematic  pose,  q /,  at 
frame  /  can  be  written  as, 

p(1/|q/)°c  n  \Psh(Ifhf)]Wsh\Pedge{If\qf)}Wed9%  (1) 

views 

where  psh(If\q_f)  and  pe^e(I/|q/)  are  the  silhouette  and 
edge  likelihood  measures  defined  as  in  [1],  and  w3h  and 
Wedge  =  1  —  wsh  are  a  priori  weighting  parameters 1  for  the 
two  terms  which  account  for  the  relative  reliability  between 
these  two  features. 

Because  our  state  carries  both  kinematic  and  veloc¬ 
ity  information,  we  model  the  likelihood  of  dynamic  pose 
[q /,  q f]  using  information  extracted  from  both  the  current 
and  the  next  frame;  we  refer  to  this  as  the  coupled  observa¬ 
tion  y f  =  [I/,  I/+i].  We  define  the  likelihood  of  the  cou¬ 
pled  observation  as  a  weighted  product  of  two  kinematic 
likelihoods  from  above: 

p(y/lx/) «  p(I/|q/MI/+i|q/+i),  (2) 

where  q/+i  =  q/  +  At  •  q/  is  the  estimate  of  the  kinematic 
state/pose  at  the  next  frame,  assuming  the  At  is  the  time 
between  the  two  consecutive  frames  (see  Figure  2). 

This  likelihood  implicitly  measures  the  velocity  level  in¬ 
formation.  Alternatively,  one  can  formulate  a  likelihood 
measure  that  explicitly  computes  the  velocity  information 
[2]  ( e.g .  using  optical  flow)  and  compares  it  to  the  corre¬ 
sponding  velocity  components  of  the  state  vector.  Notice 
that  portions  of  our  state,  x/,  such  as  control  policy,  are 
inherently  unobservable  and  are  assumed  to  have  uniform 
probability  with  respect  to  the  likelihood  function2. 

lFor  all  of  the  experiments  in  this  paper  we  use  wsh  =  wedge  —  0-5. 

2 The  resulting  dual-counting  of  observations,  only  makes  the  unnor¬ 
malized  likelihood  more  peaked,  and  can  formally  be  handled  as  in  [2] . 


Figure  3.  Prediction  Model:  Control  Loop.  Components  of  the 
control  loop  and  the  data  flow.  Each  iteration  advances  the  figure 
state  [q,  q,  7r,  v\  by  time  A  and  records  recent  events  e  so  they 
could  be  accounted  for  by  the  motion  planner  at  the  next  itera¬ 
tion.  The  little  boxes  within  the  components  represent  “memory 
locations”  holding  component- specific  state  information  preserved 
across  component  exits. 

3.3.  Prediction 

Prediction  takes  a  potential  figure  state  and  estimates 
what  its  value  at  the  next  frame  would  be  if  the  state’s  evo¬ 
lution  followed  a  certain  motion  model.  We  assume  that  hu¬ 
man  motion  is  governed  by  dynamics  and  by  a  thought  pro¬ 
cess  that  tasks  the  figure  “muscles”  so  that  desired  motion 
would  be  performed.  Our  motion  model  idealizes  this  pro¬ 
cess  and  models  the  state  evolution  by  executing  the  “con¬ 
trol  loop”  outlined  in  Figure  3. 

Given  a  figure  state  x  =  [q,  q,  7r,  v]  and  a  vector  of  sim¬ 
ulation  events3  e  that  occured  during  the  previous  loop  it¬ 
eration,  the  motion  planner  decides  what  the  next  control 
policy  7 r  will  be  and,  depending  on  the  policy,  proposes 
next  desired  kinematic  pose  q^  that  the  figure  should  fol¬ 
low.  This  desired  pose  is  then  processed  by  the  motion  con¬ 
troller  to  set  up  a  set  of  motion  constraints4,  m,  that  need 
to  be  honored  by  the  dynamics  simulator  when  updating  the 
dynamic  pose  [q,  q] .  Motion  constraints  implicitly  generate 
motor  forces  to  actuate  the  figure.  As  a  simpler  alternative 
to  constraints,  the  motion  controller  could  generate  motor 
forces  directly  by  a  proportional- derivative  servo  [17]. 

The  actual  prediction  consists  of  initializing  the  model 
from  the  given  initial  state  x,  looping  through  the  control 
loop  for  the  time  duration  of  the  frame,  At,  (this  might  take 
several  iterations  of  size  A  <C  At)  and  returning  the  state  x 
at  the  end  of  the  frame. 

3.3.1  Motion  Planning 

The  motion  planner,  denoted  by  the  function  h  in  Figure  3, 
allows  the  incorporation  of  different  motion  priors  into  the 
prediction  process.  It  is  responsible  for  picking  a  control 
policy  7r  (using  the  information  about  the  figure  state  x  and 

3  Currently,  corresponding  to  a  binary  indicator  variable  determining 
whether  a  collision  with  environment  has  occured. 

4In  case  no  desired  kinematic  pose  was  proposed,  m  =  0. 


the  feedback  e),  updating  the  frame  index  v  since  the  pol¬ 
icy  was  in  effect  and  generating  a  desired  kinematic  pose  q^ 
for  the  motion  controller  using  an  algorithm  specific  to  the 
policy,  if  applicable.  New  policies  717+1  are  sampled  from 
simple  distributions  p  (717+1  (77,  e/)  that  can  depend  on  the 
duration  of  time  the  current  policy  7r  /  has  been  in  effect; 
for  each  potential  value  of  e  /  and  7r /  there  is  one  such  dis¬ 
tribution5.  Two  control  policies  have  been  implemented  so 
far,  the  active  motion-capture  based  policy  and  the  passive 
motion  policy. 

Passive  motion.  This  policy  lets  the  figure  move  passively 
as  if  it  was  unconscious,  and  as  a  result  no  q^  is  generated 
when  in  effect.  Its  purpose  is  to  account  for  unmodeled 
dynamics  in  the  motion-capture  based  policy  and  it  should 
typically  be  activated  for  short  periods  of  time. 

Active  motion.  Our  motion  capture  based  policy  actuates 
the  figure  so  that  it  would  perform  a  motion  similar  to  the 
one  seen  in  training  motion  capture  data.  We  take  an  ex¬ 
emplar  based  approach  similar  to  that  of  [7,  11,  20].  To 
that  end,  we  first  form  a  database  of  observed  input-output 
pairs  (from  training  motion  capture  data)  between  a  dy¬ 
namic  pose  at  frame  /  and  a  kinematic  pose  at  frame  /  + 1, 
V  =  {[q},q}],q}+1}y=1.  For  pose  invariance  to  abso¬ 
lute  global  position  and  heading,  corresponding  degrees  of 
freedom  are  removed  from  qj  and  qj.  Given  this  database, 
that  can  span  training  data  from  multiple  subjects  and  activ¬ 
ities,  our  objective  is  to  determine  the  intended  kinematic 
pose  q d  given  a  new  dynamic  pose  [q,  q] .  We  formulate 
this  as  in  [11]  using  a  K  nearest  neighbors  (k-NN)  regres¬ 
sion  method,  where  a  set  of  similar  prototypes/exemplars 
to  the  query  point  [q,  q]  are  first  found  in  the  database  and 
then  the  q^  is  obtained  by  weighted  averaging  over  their 
corresponding  outputs;  the  weights  are  set  proportional  to 
the  similarity  of  the  prototype/exemplar  to  the  query  point. 
This  can  be  formally  written  as, 

qd=  ^/([q/>q/Uq.q]))-q/+i> 

[q*  ,q*  ]  Eneighborhoodlq,^] 

where  df([ qj,  q^],  [q,  q])  is  the  similarity  measure  and  JC 
is  the  kernel  function  that  determines  the  weight  falloff  as  a 
function  of  distance  from  the  query  point. 

We  use  a  similarity  measure  that  is  a  linear  combination 
of  positional  and  velocity  information, 

df([ q/,  q/],  [q,  q])  =  Wa  ■  dM( q,  q/)  +  w0  ■  dM( q,  q}), 

where  gJm(*)  denotes  a  Mahalanobis  distance  between  q 
and  q^,  and  q  and  q^,  respectively  with  covariance  matri¬ 
ces  learned  from  the  training  data,  {qj }j=1  and  {q /}/=1; 
the  wa  and  wp  are  positive  constants  that  account  for  the 
relative  weighting  of  the  two  terms.  For  the  kernel  function 
we  use  a  simple  Gaussian,  /C  =  J\f( 0,  a),  with  empirically 
determined  variance  a2. 

5  These  discrete  conditional  distributions  are  defined  empirically. 


3.3.2  Motion  Control 


The  motion  controller  g  in  Figure  3  conceptually  approx¬ 
imates  the  human’s  muscle  actuation  to  move  the  current 
pose  hypothesis  [q,  q]  towards  the  intended  kinematic  pose 
q^  when  the  figure  state  is  updated  by  dynamics.  We  formu¬ 
late  motion  control  as  a  set  of  soft  constraints  on  q  and  q. 
Each  constraint  is  defined  as  an  equality  or  inequality  with  a 
softness  constant  determining  what  portion  of  the  constraint 
force  should  actually  be  applied  to  the  constrained  bodies. 
Constraints  can  also  limit  force  magnitude  to  account  for 
biomechanical  properties  of  the  human  motion,  like  muscle 
power  limits  or  joint  resistance. 

Unlike  traditional  constraint-based  controllers  [8],  we  do 
not  directly  control  ( constrain )  the  position  of  the  figure 
root  so  that  global  translation  will  result  only  from  the  fig¬ 
ure’s  interaction  with  the  environment  (contact)6.  This  in¬ 
troduces  several  problems  that  require  a  new  approach  to 
motion  control.  Consider  the  case  where  the  desired  kine¬ 
matic  pose  q^  is  infeasible  ( e.g .  causing  penetration  with 
the  environment).  Leaving  the  linear  DOFs  unconstrained, 
in  this  case,  often  leads  to  unexpected  contacts/impacts  with 
an  environment  during  simulation  which  can  affect  the  mo¬ 
tion  adversely7.  To  address  these  problems,  we  propose 
a  new  kind  of  hybrid  constraint-based  controller  (see  Fig¬ 
ure  4)  that  aims  to  follow  desired  joint  angles  as  well  as  tra¬ 
jectories  of  selected  markers  (points)  defined  on  the  figure 
segment  geometries.  The  controller  takes  as  input  dynamic 
pose  [q,  q]  and  desired  kinematic  pose  q^  and  outputs  a  set 
of  desired  angular  velocities  q^  obtained  using  inverse  dy¬ 
namics. 

Given  the  desired  kinematic  pose  q^  and  positions  zJ  of 
markers  on  selected  figure  segments  (toes),  the  controller 
first  computes  the  marker  positions  with  respect  to  the  de¬ 
sired  pose  (using  forward  kinematics),  z3d.  These  positions 
are  then  adjusted  so  that  they  do  not  penetrate  the  environ¬ 
ment.  The  adjusted  positions  zJd  produce  requests  on  de¬ 
sired  positions  of  markers  zf  which  are  subsequently  com¬ 
bined  with  requests  on  desired  values  of  joint  angles  qfe  at 
other  figure  segments  (with  no  associated  markers).  Finally, 
these  requests  are  converted  to  constraints  m  =  {cf  =  q^} 
on  angular  velocities  that  are  passed  to  the  simulator. 

This  process  is  implemented  using  first  order  inverse  dy¬ 
namics  on  a  helper  figure,  where  position  and  orientation  re¬ 
quests  serve  as  inverse  dynamics  goals;  we  fix  the  root  seg¬ 
ment  in  the  helper  figure  to  ensure  that  these  goals  can  not 
be  solved  by  simple  translation  or  rotation  of  the  root  seg¬ 
ment.  The  process  consists  of  the  following  steps.  First,  the 

6 However,  the  orientation  of  the  root  segment  is  constrained,  which 
implements  balancing.  Although  this  is  not  physically  correct,  because  the 
orientation  can  change  regardless  of  the  support  from  the  rest  of  the  body, 
it  serves  our  purpose  well. 

7 For  example,  unwanted  impacts  at  the  end  of  the  walking  cycle  will 

force  the  figure  to  step  back  instead  of  forward. 
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Figure  4.  Motion  Controller.  Input  kinematic  pose  q  determines 
the  positions  zJ  of  markers  on  the  feet  (left),  the  desired  kinematic 
pose  q d  their  desired  positions  zJd  (middle).  Desired  positions  are 
adjusted  to  prevent  penetration  with  the  ground  and  constraints  on 
the  marker  velocities  zP  and  joint  DOF  derivatives  qfc  of  the  helper 
figure  are  formed  (right).  Superscripts  1  index  the  figure’s  angular 
DOFs,  superscripts  j  the  markers  and  superscripts  k  the  angular 
DOFs  of  the  figure  segments  that  have  no  markers  j  attached. 

pose  of  the  helper  figure  is  set  up  to  mirror  the  current  pose 
[q,  q]  of  the  original  figure.  Next,  given  the  value  of  ca  >  0 
determining  how  fast  the  controller  should  approach  the  de¬ 
sired  values,  the  requests  on  desired  positions  of  markers 
are  converted  to  soft  constraints  on  desired  marker  veloci¬ 
ties  z?  =  —  ca  •  (zJ  —  zJd),  and  the  requests  on  desired  joint 
angles  at  other  segments  are  converted  to  soft  constraints  on 
desired  joint  angle  velocities  qfc  =  —  ca  •  (qfc  —  q^).  These 
constraints  are  finally  combined  with  additional  constraints 
on  joint  angle  limits  qz  >  q^-n  and  q*  <  cfmax ;  the  con¬ 
straints  are  solved  and  final  desired  angular  velocities,  q^, 
are  obtained.  The  last  step  is  implemented  by  using  the  fa¬ 
cilities  of  the  physics  engine. 

3.3.3  Dynamical  Simulation 

The  dynamical  simulator,  denoted  by  (with  slight  abuse  of 
notation)  function  /  in  the  control  loop,  numerically  in¬ 
tegrates  an  input  dynamic  pose  forward  in  time  based  on 
Newtonian  equations  of  motion  and  specified  constraints. 
We  use  the  Crisis  physics  engine  [21]  which  provides  fa¬ 
cilities  for  constraint-based  motion  control  and  implements 
certain  features  suitable  for  motion  tracking.  The  simula¬ 
tor’s  collision  detection  library  is  used  to  validate  poses8. 

The  simulation  state  is  advanced  by  time  A  by  following 
standard  Newton-Euler  equations  of  motion,  while  obeying 
a  set  of  constraints  —  the  explicit  motion  control  constraints 
m,  soft  position  constraints  cf  >  q^ -n  and  q2  <  q^aa, 
due  to  angular  DOFs  i  implementing  joint  angle  limits,  and 
implicit  velocity  or  acceleration  constraints  enforcing  non¬ 
penetration  and  modeling  friction.  Because  constraints,  m, 
are  valid  only  with  respect  to  a  specific  dynamic  pose,  the 
constraints  have  to  be  reformulated  each  time  the  state  is  in¬ 
ternally  updated  by  the  simulator.  As  a  result,  motion  con¬ 
troller  can  be  called  back  throughout  the  simulation  process. 

8  When  noise  is  added  to  a  kinematic  pose,  it  has  to  be  determined 
whether  the  proposed  pose  is  valid  according  to  the  metrics  discussed  in 
Section  3.1. 


Prediction  Error  (0.500000  seconds  ahead) 


This  is  illustrated  by  the  corresponding  arrows  in  Figure  3. 
Once  the  simulation  completes,  the  dynamic  pose  [q,  q] 
matching  the  resulting  state  of  the  physical  representation 
is  returned.  In  order  to  provide  feedback  about  events  in  the 
simulated  world  for  the  motion  planner  (“perception”),  re¬ 
cent  simulation  events  (see  footnote  3)  are  recorded  into  e, 
which  is  returned  together  with  the  updated  pose. 

4.  Experiments 

Datasets.  In  our  experiments  we  make  use  of  the  two 
publicly  available  datasets  that  contain  synchronized  mo¬ 
tion  capture  (MoCap)  and  video  data  from  multiple  cam¬ 
eras  (@60  Htz).  The  use  of  this  data  allows  us  to  (1)  quan¬ 
titatively  analyze  the  performance  (by  treating  MoCap  as 
ground  truth),  and  (2)  obtain  reasonable  initial  poses  for  the 
first  frame  of  the  sequence  from  which  tracking  can  be  ini¬ 
tiated.  The  first  dataset,  used  in  [1],  contains  a  single  sub¬ 
ject  (LI)  performing  a  walking  motion  with  stopping,  im¬ 
aged  with  4  grayscale  cameras  (see  Figure  8).  The  second, 
HumanEva  dataset  [12]  (see  Figure  7),  contains  three  sub¬ 
jects  (SI  to  S3)  performing  a  variety  of  motions  ( e.g .  walk¬ 
ing,  jogging,  boxing)  imaged  with  7  cameras  (we,  however, 
make  use  of  the  data  from  at  most  3  color  cameras  for  our 
experiments).  Each  dataset  contains  disjoint  training  and 
testing  data,  that  we  use  accordingly. 

Error.  To  quantitatively  evaluate  the  performance  we  make 
use  of  the  metric  employed  in  [1]  and  [12],  where  pose  er¬ 
ror  is  computed  as  an  average  distance  between  a  set  of  15 
markers  defined  at  the  key  joints  and  end  points  of  the  limbs. 
Hence,  in  3D  this  error  has  an  intuitive  interpretation  of  the 
average  joint  distance,  in  (mm),  between  the  ground  truth 
and  recovered  pose.  In  our  monocular  experiments,  we  use 
an  adaptation  of  this  error,  that  measures  the  average  joint 
distance  with  respect  to  the  position  of  the  pelvis  to  avoid 
biases  that  may  arise  due  to  depth  ambiguities.  For  tracking 
experiment,  we  report  the  error  of  the  expected  pose9. 

Prediction.  The  key  aspect  of  our  physics-based  prior  is 
the  ability  to  perform  accurate  physically-plausible  predic¬ 
tions  of  the  future  state  based  on  the  current  state  estimates. 
First,  we  set  out  to  test  how  our  prediction  model  compares, 
quantitatively,  with  the  standard  prediction  models  based  on 
stationary  linear  dynamics  described  in  Section  3. 

Figure  6  (right)  shows  performance  of  the  smooth  prior 
(No  Prediction),  constant  velocity  prior,  and  individual  pre¬ 
dictions  based  on  the  two  control  strategies  implemented 
within  our  physics-based  prediction  module.  For  all  4  meth¬ 
ods  we  use  200  frames  of  motion  capture  data  from  the  LI 
sequence  to  predict  poses  from  0.05  to  0.5  seconds  ahead. 


9  Other  error  metrics  such  as  optimistic  error  [1]  and  error  of  maximum 
a  posteriori  (MAP)  pose  estimate  produce  very  similar  results. 


Figure  5.  Prediction  Error.  Error  in  predictions  (0.5  seconds 
ahead)  are  analyzed  as  a  function  of  one  walking  cycle.  Verti¬ 
cal  bars  illustrate  different  phases  of  walking  motion:  light  blue 
-  foot  hits  the  ground,  light  orange  -  change  in  the  direction  of 
the  arm  swing.  Notice  that  passive  and  dynamic  predictions  have 
complementary  behavior  during  different  motion  phases  (right). 


Average  Prediction  Error  (0.25  seconds  ahead)  Average  Prediction  Error  (0.1  velocity  sigma) 


Figure  6.  Average  Prediction  Error.  Illustrated,  on  the  right,  is 
the  quantitative  evaluation  of  4  different  dynamical  priors  for  hu¬ 
man  motion:  smooth  prior  (No  Prediction),  constant  velocity  prior 
and  (separately)  active  and  passive  physics-based  priors  imple¬ 
mented  here.  On  the  left,  performance  in  the  presence  of  noise 
is  explored.  See  text  for  further  details. 


We  then  compare  our  predictions  to  the  poses  observed  by 
motion  capture  data  at  corresponding  times. 

For  short  temporal  predictions  all  methods  perform 
well;  however,  once  the  predictions  are  made  further  into 
the  future,  our  active  motion  control  strategy,  based  on 
exemplar-based  MoCap  method,  significantly  outperforms 
the  competitors.  Overall,  the  active  motion  control  strat¬ 
egy  achieves  29%  lower  performance  error  over  the  con¬ 
stant  velocity  prior  (averaged  over  the  range  of  prediction 
times  from  0.05  to  0.5  seconds). 

Figure  6  (left)  shows  the  effect  of  noise  on  the  predic¬ 
tions.  For  a  fixed  prediction  time  of  0.25  seconds,  a  zero 
mean  Gaussian  noise  is  added  to  each  of  the  ground  truth 
dynamic  poses  before  the  prediction  is  made.  The  perfor¬ 
mance  is  then  measured  as  a  function  of  the  noise  variance. 
While  performance  of  the  constant  velocity  prior  and  pas¬ 
sive  motion  prior  degrade  with  noise,  the  performance  of 
our  active  motion  prediction  stays  low  and  flat. 

Notice  that  the  constant  velocity  prior  performs  similarly 
to  the  passive  motion;  intuitively,  this  makes  sense  since  the 
constant  velocity  prior  is  an  approximation  to  the  passive 
motion  dynamics,  that  does  not  account  for  environment  in¬ 
teractions.  Since  such  interactions  happen  infrequently  and 
we  are  averaging  over  200  frames,  the  differences  between 
the  two  methods  are  not  readily  observed,  but  are  important 


at  the  key  instants  when  they  occur  (see  Figure  5). 

Tracking  with  multiple  views.  We  now  test  the  perfor¬ 
mance  of  the  Bayesian  tracking  framework  that  incorpo¬ 
rates  the  physics-based  prior  considered  above  in  the  con¬ 
text  of  multi-view  tracking  using  a  200  frame,  4  view,  image 
sequence  of  LI.  We  first  compare  the  performance  of  the 
proposed  physics-based  prior  method  (LI),  to  two  standard 
Bayesian  filtering  approaches  that  employ  smooth  temporal 
priors,  Particle  Filtering10  (PF)  and  Annealed  Particle  Fil¬ 
ter10  with  5  levels  of  annealing  (APF  5).  To  make  the  com¬ 
parison  as  fair  as  possible  we  use  the  same  number  of  par¬ 
ticles* 11  (250),  same  likelihoods,  and  same  interpenetration 
and  joint  limit  constraints  in  all  cases;  joint  limit  constraints 
are  learned  from  training  data.  The  quantitative  results  are 
illustrated  in  Figure  9  (left).  Our  method  has  72%  lower 
error  then  PF  and  47%  lower  error  then  APF,  as  well  as 
considerably  lower  variance.  Qualitative  visualization  of  re¬ 
sults  analyzed  in  Figure  9  is  not  shown  due  to  lack  of  space; 
typical  performance,  on  HumanEva  sequence  (with  error 
93.4  zb  24.8),  is  illustrated  in  Figure  7. 

We  have  also  tested  how  performance  of  our  method  de¬ 
grades  with  larger  training  sets  that  come  from  other  sub¬ 
jects  performing  similar  (walking)  motions  (see  Physics  Sl- 
S3  LI).  It  can  be  seen  that  additional  training  data  does  not 
noticeably  degrade  the  performance  of  our  method,  which 
suggests  that  our  approach  is  able  to  scale  to  large  datasets. 
We  also  test  whether  or  not  our  approach  can  generalize, 
by  training  on  data  of  subjects  from  HumanEva  dataset 
and  running  on  a  different  subject,  LI,  from  the  dataset  of 
[1]  (Physics  S1-S3).  The  results  are  encouraging  in  that  we 
can  still  achieve  reasonable  performance  that  has  lower  er¬ 
ror  then  PF  and  APF  (noise  and  joint  levels  of  which  were 
trained  using  subject  specific  data  of  LI).  While  due  to  the 
exemplar-based  nature  of  our  active  controller  it  is  likely 
that  our  method  would  not  be  able  to  generalize  to  unob¬ 
served  motions,  our  experiments  tend  to  indicate  that  it  can 
generalize  within  observed  classes  of  motions  given  suffi¬ 
cient  amount  of  training  data. 

Monocular  Tracking.  The  most  significant  benefit  of  our 
approach  is  that  it  can  deal  with  monocular  tracking.  Physi¬ 
cal  constraints  embedded  in  our  prior  help  to  properly  place 
the  hypotheses  and  avoid  overfitting  of  image  evidence  that 
in  the  monocular  case  lack  3D  information  (see  Figure  8 
(Physics));  the  results  from  PF  and  APF  on  the  other  hand 
tend  to  overfit  the  image  evidence,  resulting  in  physically 
implausible  3D  hypothesis  (see  Figure  8  (APF  5)  bottom) 
and  lead  to  more  severe  problems  with  local  optima  (see 
Figure  8  (APF  5)  top).  Figure  8  (Physics)  bottom,  illus- 


10  We  make  use  of  the  public  implementation  by  Balan  et  al.  [1]  available 
from  http  :  / /www.  cs  .brown,  edu/people/alb/. 

11  In  APF  we  use  250  particles  for  each  annealing  layer. 


Figure  7.  Multi- view  Tracking.  Tracking  performance  on  the  Jog 
sequence  of  subject  S3  form  HumanEva  dataset;  250  particles 
are  used  for  tracking.  Illustrated  is  the  projection  of  the  tracked 
model  into  one  of  the  3  views  used  for  inference. 

trates  the  physical  plausibility  of  the  recovered  3D  poses 
using  our  approach.  Quantitatively,  on  the  monocular  se¬ 
quence,  our  model  has  71%  lower  error  then  PF  and  74% 
lower  error  then  APF,  with  once  again  considerably  lower 
(roughly  |  to  |)  variance  (see  Figure  9  right). 

Analysis  of  computation  time.  While  the  tracking  frame¬ 
work  was  implemented  in  Matlab,  the  Physics  prediction 
engine  was  developed  in  C++.  As  a  result,  the  overhead  im¬ 
posed  by  the  physics  simulation  and  motion  control  is  neg¬ 
ligible  with  respect  to  the  likelihood12  computation.  The 
overhead  imposed  by  the  motion  planning  is  a  function 
of  the  number  of  training  examples;  in  our  experiments 
corresponding  to  11-20%.  The  sub-linear  approximations 
to  k-NN  regression  [11]  can  make  this  more  tractable  for 
large  datasets.  The  raw  per  particle  computations  in  sec¬ 
onds  for  each  of  the  approaches  are:  PF  -  0.0280,  APF  5  - 
0.1525,  Physics  (no  motion  planning)  -  0.0560,  Physics  LI 
-  0.0624,  Physics  SI,  S2,  S3,  LI  -  0.0672. 

5.  Discussion  and  Conclusions 

We  presented  a  framework  that  incorporates  the  full- 
body  physics-based  constrained  simulation,  as  a  temporal 
prior,  into  the  articulated  Bayesian  tracking.  As  a  result,  we 
are  able  to  account  for  non-linear  non- stationary  dynamics 
of  the  human  body  and  interactions  with  the  environment 
(e.g.  ground  contact).  To  allow  tractable  inference  we  also 
introduce  two  controllers:  a  novel  hybrid  constraint-based 
controller,  which  uses  motion-capture  data  to  actuate  the 
body,  and  a  passive  motion  controller.  Using  these  tools,  we 
illustrate  that  our  approach  can  better  model  the  dynamical 
process  underlying  human  motion,  and  achieve  physically 
plausible  tracking  results  using  multi- view  and  monocular 
imagery.  We  show  both  qualitatively  and  qualitatively  that 
the  resulting  tracking  performance  is  more  accurate  and  nat¬ 
ural  (physically  plausible)  than  results  obtained  using  stan¬ 
dard  Bayesian  filtering  methods  such  as  Particle  Filtering 
(PF)  or  Annealed  Particle  Filtering  (APF).  In  the  future,  we 
plan  to  explore  richer  physical  models  and  control  strate¬ 
gies,  which  may  further  loosen  the  current  reliance  of  our 

12 The  likelihood  evaluations,  however,  in  our  framework  involve  com¬ 
puting  the  likelihood  over  two  frames  (rather  than  one  in  PF)  and  hence 
are  twice  as  expensive;  the  number  of  likelihood  evaluations  in  APF  is  a 
function  of  the  number  of  layers. 


Figure  8.  Monocular  Tracking.  Visualization  of  performance  on 
a  monocular  walking  sequence  of  subject  LI.  Illustrated  is  the  per¬ 
formance  of  the  proposed  method  (Physics)  versus  the  Annealed 
Particle  Filter  (APF  5);  in  both  cases  with  1000  particles.  The  top 
row  shows  projections  (into  the  view  used  for  inference)  of  the  re¬ 
sulting  3D  poses  at  20-frame  increments;  bottom  shows  the  corre¬ 
sponding  rendering  of  the  model  in  3D  along  with  the  ground  con¬ 
tacts.  Our  method,  unlike  APF,  does  not  suffer  from  out-of-plane 
rotations  and  has  consistent  ground  contact  pattern.  For  quantita¬ 
tive  evaluation  see  Figure  9  (right). 


method  on  motion-capture  training  data. 
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